This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.

Question 1

Section 1

Import Libraries

# install.packages('Hmisc')
# install.packages("ggpubr")
library(tidyverse)
library(emmeans) # for emmeans(), contrast(), pairs()
library(gridExtra)
library(knitr)
library(kableExtra)
library(Hmisc) # for correlation functions
library(ggpubr) # for ggarrange
options(width=100)

Data Preparation

# Read data and reformat data from Characters to Factors
food_df <- read.csv("2019-20-enforcement-data-food-hygiene.csv", stringsAsFactors = TRUE)

# Structure
#str(food_df)

# Summary
#summary(food_df)

# Select the variables needed and create a new dataframe
new_food_df <- food_df[, c(1:6, 19:24, 36)]

# Create a new column. Because the establishments not yet rated and not in the programme are not what we are going to analyse, they are deducted.
new_food_df <- new_food_df %>% mutate(EstratedforIntv =  food_df$Totalestablishments.includingnotyetrated.outside. - food_df$Establishmentsnotyetratedforintervention - food_df$Establishmentsoutsidetheprogramme) 

str(new_food_df)
## 'data.frame':    353 obs. of  14 variables:
##  $ Country                                          : Factor w/ 3 levels "England","Northern Ireland",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LAType                                           : Factor w/ 6 levels "District Council",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LAName                                           : Factor w/ 353 levels "Adur and Worthing ",..: 1 2 3 8 9 10 11 12 16 17 ...
##  $ Totalestablishments.includingnotyetrated.outside.: int  1478 1316 1112 1208 905 1132 1746 2002 574 1400 ...
##  $ Establishmentsnotyetratedforintervention         : int  24 29 1 44 26 0 58 40 41 84 ...
##  $ Establishmentsoutsidetheprogramme                : int  0 74 0 1 1 0 214 39 0 42 ...
##  $ Total.ofInterventionsachieved.premisesratedA.E.  : num  96.1 90.6 88.9 94 80.7 ...
##  $ Total.ofInterventionsachieved.premisesratedA     : Factor w/ 30 levels "","100","33.33",..: 2 2 2 2 5 8 2 2 2 2 ...
##  $ Total.ofInterventionsachieved.premisesratedB     : num  100 98.3 95.1 96.3 100 ...
##  $ Total.ofInterventionsachieved.premisesratedC     : num  95.5 89.7 97 94.4 78.8 ...
##  $ Total.ofInterventionsachieved.premisesratedD     : num  96 93 91.8 92.6 85.3 ...
##  $ Total.ofInterventionsachieved.premisesratedE     : num  94 85.1 72.3 95.5 68.3 ...
##  $ ProfessionalFullTimeEquivalentPosts.occupied..   : num  5 4 3.5 4 2 4.65 2.5 5 2 4.2 ...
##  $ EstratedforIntv                                  : int  1454 1213 1111 1163 878 1132 1474 1923 533 1274 ...
summary(new_food_df)
##              Country                             LAType                        LAName   
##  England         :320   District Council            :194   Adur and Worthing      :  1  
##  Northern Ireland: 11   London Borough              : 33   Allerdale              :  1  
##  Wales           : 22   Metropolitan Borough Council: 37   Amber Valley           :  1  
##                         NI Unitary Authority        : 11   Anglesey               :  1  
##                         Unitary Authority           : 56   Antrim and Newtownabbey:  1  
##                         Welsh Unitary Authority     : 22   Ards and North Down    :  1  
##                                                            (Other)                :347  
##  Totalestablishments.includingnotyetrated.outside. Establishmentsnotyetratedforintervention
##  Min.   : 145.0                                    Min.   :   0.00                         
##  1st Qu.: 920.5                                    1st Qu.:  25.00                         
##  Median :1330.0                                    Median :  49.00                         
##  Mean   :1620.7                                    Mean   :  89.75                         
##  3rd Qu.:2004.5                                    3rd Qu.: 100.00                         
##  Max.   :9277.0                                    Max.   :1744.00                         
##  NA's   :6                                         NA's   :6                               
##  Establishmentsoutsidetheprogramme Total.ofInterventionsachieved.premisesratedA.E.
##  Min.   :  0.00                    Min.   : 20.64                                 
##  1st Qu.:  0.00                    1st Qu.: 81.81                                 
##  Median :  2.00                    Median : 90.82                                 
##  Mean   : 49.62                    Mean   : 86.62                                 
##  3rd Qu.: 39.00                    3rd Qu.: 95.39                                 
##  Max.   :865.00                    Max.   :100.00                                 
##  NA's   :6                         NA's   :6                                      
##  Total.ofInterventionsachieved.premisesratedA Total.ofInterventionsachieved.premisesratedB
##  100    :282                                  Min.   : 50.00                              
##  NR     : 24                                  1st Qu.: 93.53                              
##         :  6                                  Median : 97.75                              
##  50     :  3                                  Mean   : 95.25                              
##  80     :  3                                  3rd Qu.:100.00                              
##  88.89  :  3                                  Max.   :100.00                              
##  (Other): 32                                  NA's   :6                                   
##  Total.ofInterventionsachieved.premisesratedC Total.ofInterventionsachieved.premisesratedD
##  Min.   : 18.37                               Min.   : 19.77                              
##  1st Qu.: 89.27                               1st Qu.: 82.00                              
##  Median : 94.97                               Median : 91.72                              
##  Mean   : 91.84                               Mean   : 86.32                              
##  3rd Qu.: 97.77                               3rd Qu.: 95.98                              
##  Max.   :100.00                               Max.   :100.00                              
##  NA's   :6                                    NA's   :6                                   
##  Total.ofInterventionsachieved.premisesratedE ProfessionalFullTimeEquivalentPosts.occupied..
##  Min.   :  1.81                               Min.   : 0.65                                 
##  1st Qu.: 65.39                               1st Qu.: 2.50                                 
##  Median : 87.50                               Median : 3.41                                 
##  Mean   : 77.37                               Mean   : 4.10                                 
##  3rd Qu.: 95.90                               3rd Qu.: 5.00                                 
##  Max.   :100.00                               Max.   :22.13                                 
##  NA's   :6                                    NA's   :6                                     
##  EstratedforIntv 
##  Min.   : 144.0  
##  1st Qu.: 878.5  
##  Median :1225.0  
##  Mean   :1481.3  
##  3rd Qu.:1788.0  
##  Max.   :8541.0  
##  NA's   :6

Create a plot that clearly shows the distribution across the Local Authorities (LAs) of the % of enforcement actions successfully achieved

# Change the type of variable from factor to numeric
new_food_df$Total.ofInterventionsachieved.premisesratedA <- as.numeric(as.character(new_food_df$Total.ofInterventionsachieved.premisesratedA))

# Distribution across the LAs
ggplot(new_food_df, aes(x=Total.ofInterventionsachieved.premisesratedA.E.)) + geom_histogram(binwidth=1, color='#9DBEBB') + ggtitle("The Distribution of % of enforcement actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedA.E., na.rm = TRUE)), colour="red") + labs(x="Proportion of Interventions Achieved(overall)") + theme(text=element_text(size=12))

# Overall distribution across LAs by "Country"
ggplot(new_food_df) + geom_histogram(aes(x=Total.ofInterventionsachieved.premisesratedA.E., fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedA.E., na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=12))

# Respective distribution across LAs by "Country"
ggarrange(
estratedA.country <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedA, fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("(A)The Distribution of % of enforcement \n actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedA, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedB.country <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedB, fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("(B)The Distribution of % of enforcement \n actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedB, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedC.country <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedC, fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("(C)The Distribution of % of enforcement \n actions successfully") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedC, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedD.country <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedD, fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("(D)The Distribution of % of enforcement \n actions successfully") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedD, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedE.country <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedE, fill=Country ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("(E)The Distribution of % of enforcement \n actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedE, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)), common.legend = TRUE)

# Overall distribution across LAs by "LATypes"
ggplot(new_food_df) + geom_histogram(aes(x=Total.ofInterventionsachieved.premisesratedA.E., fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement actions successfully achieved") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedA.E., na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved")

# Respective distribution across LAs by "LATypes"
ggarrange(
estratedA.latype <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedA, fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement \n actions successfully achieved rated A") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedA, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedB.latype <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedB, fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement \n actions successfully achieved rated B") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedB, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedC.latype <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedC, fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement \n actions successfully achieved rated C") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedC, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedD.latype <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedD, fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement \n actions successfully achieved rated D") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedD, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)),

estratedE.latype <- ggplot(new_food_df) + geom_histogram(aes(x=new_food_df$Total.ofInterventionsachieved.premisesratedE, fill=LAType ), alpha=0.7, color='black', binwidth=2.5) + ggtitle("The Distribution of % of enforcement \n actions successfully achieved rated E") + geom_vline(aes(xintercept = mean(new_food_df$Total.ofInterventionsachieved.premisesratedE, na.rm = TRUE)), colour="#474647") + labs(x="The % of Interventions achieved") + theme(text=element_text(size=6)), common.legend = TRUE)

Whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions. Examine whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority

# Plot the regression line 
ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedA.E., x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==-0.02), subtitle="Each point is a local authority. The shaded area shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

ggarrange(
#A
scresp.by.emply.A <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedA, x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.05), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#B
scresp.by.emply.B <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedB, x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.1), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#C
scresp.by.emply.C <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedC, x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==-0.01), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#D
scresp.by.emply.D <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedD, x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==-0.19), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#E
scresp.by.emply.E <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedE, x=ProfessionalFullTimeEquivalentPosts.occupied.., alpha=0.3)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.03), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),
common.legend = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Calculate the correlation coefficients
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedA.E., ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                 Total.ofInterventionsachieved.premisesratedA.E.
## Total.ofInterventionsachieved.premisesratedA.E.                                            1.00
## ProfessionalFullTimeEquivalentPosts.occupied..                                            -0.02
##                                                 ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedA.E.                                          -0.02
## ProfessionalFullTimeEquivalentPosts.occupied..                                            1.00
## 
## n= 347 
## 
## 
## P
##                                                 Total.ofInterventionsachieved.premisesratedA.E.
## Total.ofInterventionsachieved.premisesratedA.E.                                                
## ProfessionalFullTimeEquivalentPosts.occupied..  0.6552                                         
##                                                 ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedA.E. 0.6552                                        
## ProfessionalFullTimeEquivalentPosts.occupied..
#A
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedA, ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                           1.00
## ProfessionalFullTimeEquivalentPosts.occupied..                                         0.05
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedA                                             0.05
## ProfessionalFullTimeEquivalentPosts.occupied..                                           1.00
## 
## n
##                                                Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                            323
## ProfessionalFullTimeEquivalentPosts.occupied..                                          323
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedA                                              323
## ProfessionalFullTimeEquivalentPosts.occupied..                                            347
## 
## P
##                                                Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                               
## ProfessionalFullTimeEquivalentPosts.occupied.. 0.3689                                      
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedA   0.3689                                        
## ProfessionalFullTimeEquivalentPosts.occupied..
#B
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedB, ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                Total.ofInterventionsachieved.premisesratedB
## Total.ofInterventionsachieved.premisesratedB                                            1.0
## ProfessionalFullTimeEquivalentPosts.occupied..                                          0.1
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedB                                              0.1
## ProfessionalFullTimeEquivalentPosts.occupied..                                            1.0
## 
## n= 347 
## 
## 
## P
##                                                Total.ofInterventionsachieved.premisesratedB
## Total.ofInterventionsachieved.premisesratedB                                               
## ProfessionalFullTimeEquivalentPosts.occupied.. 0.065                                       
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedB   0.065                                         
## ProfessionalFullTimeEquivalentPosts.occupied..
#C
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedC, ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                Total.ofInterventionsachieved.premisesratedC
## Total.ofInterventionsachieved.premisesratedC                                           1.00
## ProfessionalFullTimeEquivalentPosts.occupied..                                        -0.01
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedC                                            -0.01
## ProfessionalFullTimeEquivalentPosts.occupied..                                           1.00
## 
## n= 347 
## 
## 
## P
##                                                Total.ofInterventionsachieved.premisesratedC
## Total.ofInterventionsachieved.premisesratedC                                               
## ProfessionalFullTimeEquivalentPosts.occupied.. 0.9048                                      
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedC   0.9048                                        
## ProfessionalFullTimeEquivalentPosts.occupied..
#D
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedD, ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                Total.ofInterventionsachieved.premisesratedD
## Total.ofInterventionsachieved.premisesratedD                                           1.00
## ProfessionalFullTimeEquivalentPosts.occupied..                                        -0.19
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedD                                            -0.19
## ProfessionalFullTimeEquivalentPosts.occupied..                                           1.00
## 
## n= 347 
## 
## 
## P
##                                                Total.ofInterventionsachieved.premisesratedD
## Total.ofInterventionsachieved.premisesratedD                                               
## ProfessionalFullTimeEquivalentPosts.occupied.. 3e-04                                       
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedD   3e-04                                         
## ProfessionalFullTimeEquivalentPosts.occupied..
#E
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedE, ProfessionalFullTimeEquivalentPosts.occupied..)))
##                                                Total.ofInterventionsachieved.premisesratedE
## Total.ofInterventionsachieved.premisesratedE                                           1.00
## ProfessionalFullTimeEquivalentPosts.occupied..                                         0.03
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedE                                             0.03
## ProfessionalFullTimeEquivalentPosts.occupied..                                           1.00
## 
## n= 347 
## 
## 
## P
##                                                Total.ofInterventionsachieved.premisesratedE
## Total.ofInterventionsachieved.premisesratedE                                               
## ProfessionalFullTimeEquivalentPosts.occupied.. 0.556                                       
##                                                ProfessionalFullTimeEquivalentPosts.occupied..
## Total.ofInterventionsachieved.premisesratedE   0.556                                         
## ProfessionalFullTimeEquivalentPosts.occupied..
#All
m.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedA.E. ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedA.E. ~ 
##     ProfessionalFullTimeEquivalentPosts.occupied.., data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.304  -4.575   4.067   8.658  13.860 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                     87.1091     1.2828  67.905   <2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied..  -0.1195     0.2675  -0.447    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.4 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.0005787,  Adjusted R-squared:  -0.002318 
## F-statistic: 0.1998 on 1 and 345 DF,  p-value: 0.6552
cbind(coefficient=coef(m.scresp.by.emply), confint(m.scresp.by.emply))
##                                                coefficient      2.5 %     97.5 %
## (Intercept)                                     87.1091495 84.5860343 89.6322647
## ProfessionalFullTimeEquivalentPosts.occupied..  -0.1195469 -0.6456029  0.4065092
#A
m.A.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedA ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.A.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedA ~ ProfessionalFullTimeEquivalentPosts.occupied.., 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.508   1.657   1.954   2.132   2.452 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                     97.4513     0.8118   120.0   <2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied..   0.1486     0.1652     0.9    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.504 on 321 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.002516,   Adjusted R-squared:  -0.0005913 
## F-statistic: 0.8097 on 1 and 321 DF,  p-value: 0.3689
cbind(coefficient=coef(m.A.scresp.by.emply), confint(m.A.scresp.by.emply))
##                                                coefficient     2.5 %     97.5 %
## (Intercept)                                     97.4513405 95.854263 99.0484176
## ProfessionalFullTimeEquivalentPosts.occupied..   0.1486341 -0.176336  0.4736041
#B
m.B.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedB ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.B.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedB ~ ProfessionalFullTimeEquivalentPosts.occupied.., 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.815  -1.805   2.418   4.533   5.593 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                     94.1355     0.7040 133.711   <2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied..   0.2717     0.1468   1.851    0.065 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.804 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.009833,   Adjusted R-squared:  0.006963 
## F-statistic: 3.426 on 1 and 345 DF,  p-value: 0.06503
cbind(coefficient=coef(m.B.scresp.by.emply), confint(m.B.scresp.by.emply))
##                                                coefficient      2.5 %     97.5 %
## (Intercept)                                     94.1355388 92.7508225 95.5202552
## ProfessionalFullTimeEquivalentPosts.occupied..   0.2716966 -0.0170094  0.5604025
#C
m.C.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedC ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.C.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedC ~ ProfessionalFullTimeEquivalentPosts.occupied.., 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.538  -2.617   3.183   5.963   8.168 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                    91.94143    0.96000   95.77   <2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied.. -0.02396    0.20016   -0.12    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.277 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  4.152e-05,  Adjusted R-squared:  -0.002857 
## F-statistic: 0.01433 on 1 and 345 DF,  p-value: 0.9048
cbind(coefficient=coef(m.C.scresp.by.emply), confint(m.C.scresp.by.emply))
##                                                coefficient      2.5 %     97.5 %
## (Intercept)                                    91.94143089 90.0532383 93.8296235
## ProfessionalFullTimeEquivalentPosts.occupied.. -0.02395684 -0.4176349  0.3697212
#D
m.D.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedD ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.D.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedD ~ ProfessionalFullTimeEquivalentPosts.occupied.., 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.581  -4.915   4.727   9.449  21.225 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                      90.887      1.468  61.923  < 2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied..   -1.113      0.306  -3.638 0.000317 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.03694,    Adjusted R-squared:  0.03415 
## F-statistic: 13.23 on 1 and 345 DF,  p-value: 0.0003171
cbind(coefficient=coef(m.D.scresp.by.emply), confint(m.D.scresp.by.emply))
##                                                coefficient     2.5 %     97.5 %
## (Intercept)                                      90.887020 88.000153 93.7738874
## ProfessionalFullTimeEquivalentPosts.occupied..   -1.113208 -1.715105 -0.5113118
#E
m.E.scresp.by.emply <- lm(Total.ofInterventionsachieved.premisesratedE ~ ProfessionalFullTimeEquivalentPosts.occupied.., data=new_food_df)
summary(m.E.scresp.by.emply)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedE ~ ProfessionalFullTimeEquivalentPosts.occupied.., 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.724 -12.405   9.929  18.679  23.582 
## 
## Coefficients:
##                                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                     76.1117     2.4930  30.530   <2e-16 ***
## ProfessionalFullTimeEquivalentPosts.occupied..   0.3063     0.5198   0.589    0.556    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.09 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.001006,   Adjusted R-squared:  -0.00189 
## F-statistic: 0.3474 on 1 and 345 DF,  p-value: 0.556
cbind(coefficient=coef(m.E.scresp.by.emply), confint(m.E.scresp.by.emply))
##                                                coefficient      2.5 %    97.5 %
## (Intercept)                                     76.1117279 71.2082531 81.015203
## ProfessionalFullTimeEquivalentPosts.occupied..   0.3063475 -0.7160008  1.328696

examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority – essentially creating your own measure of how many food safety employees there are per establishment

# Create a new variable
new_food_df <- new_food_df %>% mutate(Propofemplyest = ProfessionalFullTimeEquivalentPosts.occupied.. / EstratedforIntv) 

#Overall
ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedA.E., x=Propofemplyest)) + geom_point() + labs(x="The number of employees as a proportion of the number of establishments ", y="Proportion of successful responses",title = expression(r==0.21), subtitle="Each point is a local authority. The shaded area shows the 95% CI for the best-fitting regression line") + geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedA.E., Propofemplyest)))
##                                                 Total.ofInterventionsachieved.premisesratedA.E.
## Total.ofInterventionsachieved.premisesratedA.E.                                            1.00
## Propofemplyest                                                                             0.21
##                                                 Propofemplyest
## Total.ofInterventionsachieved.premisesratedA.E.           0.21
## Propofemplyest                                            1.00
## 
## n= 347 
## 
## 
## P
##                                                 Total.ofInterventionsachieved.premisesratedA.E.
## Total.ofInterventionsachieved.premisesratedA.E.                                                
## Propofemplyest                                  1e-04                                          
##                                                 Propofemplyest
## Total.ofInterventionsachieved.premisesratedA.E. 1e-04         
## Propofemplyest
m.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedA.E. ~ Propofemplyest, data=new_food_df)
summary(m.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedA.E. ~ 
##     Propofemplyest, data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.181  -5.341   4.326   8.321  15.998 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      79.536      1.929  41.233  < 2e-16 ***
## Propofemplyest 2407.398    617.062   3.901 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.14 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.04225,    Adjusted R-squared:  0.03948 
## F-statistic: 15.22 on 1 and 345 DF,  p-value: 0.000115
cbind(coefficient=coef(m.scresp.by.propemplyest), confint(m.scresp.by.propemplyest))
##                coefficient      2.5 %    97.5 %
## (Intercept)       79.53564   75.74169   83.3296
## Propofemplyest  2407.39810 1193.72035 3621.0758
ggarrange(
#A
scresp.by.propemlyest.A <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedA, x=Propofemplyest)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==-0.08), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#B
scresp.by.propemlyest.B <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedB, x=Propofemplyest)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.02), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#C
scresp.by.propemlyest.C <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedC, x=Propofemplyest)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.13), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#D
scresp.by.propemlyest.D <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedD, x=Propofemplyest)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses",title = expression(r==0.13), subtitle="Each point is a local authority. The shaded area \n shows the 95% CI for the best-fitting regression \n line") + geom_smooth(method=lm) + theme(text=element_text(size=6)),

#E
scresp.by.propemlyest.E <- ggplot(new_food_df, aes(y=Total.ofInterventionsachieved.premisesratedE, x=Propofemplyest)) + geom_point() + labs(x="FTE food safety employees", y="Proportion of successful responses", title = expression(r==0.2), subtitle="Each point is a local authority. The shaded area shows the 95% CI for the best-fitting regression line") + geom_smooth(method=lm) + theme(text=element_text(size=6)))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#A
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedA, Propofemplyest)))
##                                              Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                         1.00
## Propofemplyest                                                                      -0.08
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedA          -0.08
## Propofemplyest                                         1.00
## 
## n
##                                              Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                          323
## Propofemplyest                                                                        323
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedA            323
## Propofemplyest                                          347
## 
## P
##                                              Total.ofInterventionsachieved.premisesratedA
## Total.ofInterventionsachieved.premisesratedA                                             
## Propofemplyest                               0.1665                                      
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedA 0.1665        
## Propofemplyest
m.A.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedA ~ Propofemplyest, data=new_food_df)
summary(m.A.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedA ~ Propofemplyest, 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.715   1.390   1.737   2.129   5.558 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      99.652      1.209  82.411   <2e-16 ***
## Propofemplyest -533.295    384.617  -1.387    0.167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.491 on 321 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.005954,   Adjusted R-squared:  0.002857 
## F-statistic: 1.923 on 1 and 321 DF,  p-value: 0.1665
cbind(coefficient=coef(m.A.scresp.by.propemplyest), confint(m.A.scresp.by.propemplyest))
##                coefficient       2.5 %   97.5 %
## (Intercept)       99.65163    97.27268 102.0306
## Propofemplyest  -533.29480 -1289.98324 223.3936
#B
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedB, Propofemplyest)))
##                                              Total.ofInterventionsachieved.premisesratedB
## Total.ofInterventionsachieved.premisesratedB                                         1.00
## Propofemplyest                                                                       0.02
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedB           0.02
## Propofemplyest                                         1.00
## 
## n= 347 
## 
## 
## P
##                                              Total.ofInterventionsachieved.premisesratedB
## Total.ofInterventionsachieved.premisesratedB                                             
## Propofemplyest                               0.7116                                      
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedB 0.7116        
## Propofemplyest
m.B.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedB ~ Propofemplyest, data=new_food_df)
summary(m.B.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedB ~ Propofemplyest, 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.280  -1.651   2.577   4.614   4.951 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      94.871      1.087   87.31   <2e-16 ***
## Propofemplyest  128.593    347.586    0.37    0.712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.836 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.0003966,  Adjusted R-squared:  -0.002501 
## F-statistic: 0.1369 on 1 and 345 DF,  p-value: 0.7116
cbind(coefficient=coef(m.B.scresp.by.propemplyest), confint(m.B.scresp.by.propemplyest))
##                coefficient      2.5 %    97.5 %
## (Intercept)       94.87123   92.73414  97.00833
## Propofemplyest   128.59335 -555.06003 812.24674
#C
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedC, Propofemplyest)))
##                                              Total.ofInterventionsachieved.premisesratedC
## Total.ofInterventionsachieved.premisesratedC                                         1.00
## Propofemplyest                                                                       0.13
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedC           0.13
## Propofemplyest                                         1.00
## 
## n= 347 
## 
## 
## P
##                                              Total.ofInterventionsachieved.premisesratedC
## Total.ofInterventionsachieved.premisesratedC                                             
## Propofemplyest                               0.0128                                      
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedC 0.0128        
## Propofemplyest
m.C.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedC ~ Propofemplyest, data=new_food_df)
summary(m.C.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedC ~ Propofemplyest, 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.142  -2.717   3.234   5.867  10.093 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      88.402      1.461  60.490   <2e-16 ***
## Propofemplyest 1169.563    467.512   2.502   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.194 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.01782,    Adjusted R-squared:  0.01497 
## F-statistic: 6.258 on 1 and 345 DF,  p-value: 0.01282
cbind(coefficient=coef(m.C.scresp.by.propemplyest), confint(m.C.scresp.by.propemplyest))
##                coefficient     2.5 %     97.5 %
## (Intercept)       88.40198  85.52752   91.27643
## Propofemplyest  1169.56330 250.03105 2089.09555
#D
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedD, Propofemplyest)))
##                                              Total.ofInterventionsachieved.premisesratedD
## Total.ofInterventionsachieved.premisesratedD                                         1.00
## Propofemplyest                                                                       0.13
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedD           0.13
## Propofemplyest                                         1.00
## 
## n= 347 
## 
## 
## P
##                                              Total.ofInterventionsachieved.premisesratedD
## Total.ofInterventionsachieved.premisesratedD                                             
## Propofemplyest                               0.0177                                      
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedD 0.0177        
## Propofemplyest
m.D.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedD ~ Propofemplyest, data=new_food_df)
summary(m.D.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedD ~ Propofemplyest, 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.709  -4.475   5.014   9.216  15.590 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      81.213      2.279  35.640   <2e-16 ***
## Propofemplyest 1736.630    728.950   2.382   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.34 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.01619,    Adjusted R-squared:  0.01333 
## F-statistic: 5.676 on 1 and 345 DF,  p-value: 0.01774
cbind(coefficient=coef(m.D.scresp.by.propemplyest), confint(m.D.scresp.by.propemplyest))
##                coefficient     2.5 %     97.5 %
## (Intercept)       81.21274  76.73085   85.69463
## Propofemplyest  1736.63029 302.88430 3170.37628
#E
rcorr(as.matrix(select(new_food_df, Total.ofInterventionsachieved.premisesratedE, Propofemplyest)))
##                                              Total.ofInterventionsachieved.premisesratedE
## Total.ofInterventionsachieved.premisesratedE                                          1.0
## Propofemplyest                                                                        0.2
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedE            0.2
## Propofemplyest                                          1.0
## 
## n= 347 
## 
## 
## P
##                                              Total.ofInterventionsachieved.premisesratedE
## Total.ofInterventionsachieved.premisesratedE                                             
## Propofemplyest                               3e-04                                       
##                                              Propofemplyest
## Total.ofInterventionsachieved.premisesratedE 3e-04         
## Propofemplyest
m.E.scresp.by.propemplyest <- lm(Total.ofInterventionsachieved.premisesratedE ~ Propofemplyest, data=new_food_df)
summary(m.E.scresp.by.propemplyest)
## 
## Call:
## lm(formula = Total.ofInterventionsachieved.premisesratedE ~ Propofemplyest, 
##     data = new_food_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.081 -12.719   8.185  17.613  29.161 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      64.291      3.758  17.109  < 2e-16 ***
## Propofemplyest 4444.401   1202.058   3.697 0.000253 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.64 on 345 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.03811,    Adjusted R-squared:  0.03533 
## F-statistic: 13.67 on 1 and 345 DF,  p-value: 0.0002534
cbind(coefficient=coef(m.E.scresp.by.propemplyest), confint(m.E.scresp.by.propemplyest))
##                coefficient     2.5 %    97.5 %
## (Intercept)       64.29105   56.9003   71.6818
## Propofemplyest  4444.40073 2080.1158 6808.6857

Section 2

This report presents the results of the analyses requested by the Food Standards Agency. This used the data provided for 353 local authorities.

First, let’s see the distribution across the Local Authorities(LAs) of the % of enforcement actions successfully achieved. It is a left-skewed histogram, with 86.62% average % of enforcement actions successfully achieved. We can clearly see that most of the enforcement actions are successfullly achieved.

The following graph shows overall distribution across LAs by different “Country”. We can see that England is the majority country, and many of the Northern Ireland actions are successfully achieved.

The following graph shows respective distribution of establishments rated from A to E across LAs by different “Country”. All of them are left-skewed histograms, showing similar distributions to the overall(A to E) distribution.

The following graph shows overall distribution across LAs by different “LATypes”. We can see that District Council is the majority type, and the number of cases in Welsh Unitary Authority is the least.

The following graph shows respective distribution of establishments rated from A to E across LAs by different “LATypes”

Next, we plot the regression line to see the relation between FTE food safety employees and the proportion of successful responses.

## `geom_smooth()` using formula = 'y ~ x'

And we further look at the relation between FTE food safety employees and the proportion of successful responses for establishments rated from A to E.

ggarrange(scresp.by.propemlyest.A, scresp.by.propemlyest.B, scresp.by.propemlyest.C, scresp.by.propemlyest.D, scresp.by.propemlyest.E)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 30 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).

Overall(A to E): For every extra FTE food safety employees, 0.12 lower proportion of successful responses is achieved 95% CI [-0.65–0.41]. The FTE food safety employees explains 0.06% of the variance in proportion of successful responses.

A: For every extra FTE food safety employees, 0.15 greater proportion of successful responses is achieved 95% CI [-0.18–0.47]. The FTE food safety employees explains 0.25% of the variance in proportion of successful responses.

B: For every extra FTE food safety employees, 0.27 greater proportion of successful responses is achieved 95% CI [-0.02–0.56]. The FTE food safety employees explains 0.98% of the variance in proportion of successful responses.

C: For every extra FTE food safety employees, 0.02 lower proportion of successful responses is achieved 95% CI [-0.42–0.37]. The FTE food safety employees explains <0.01% of the variance in proportion of successful responses.

D: For every extra FTE food safety employees, 1.11 lower proportion of successful responses is achieved 95% CI [-1.72–0.51]. The FTE food safety employees explains 3.69% of the variance in proportion of successful responses.

E: For every extra FTE food safety employees, 0.31 greater proportion of successful responses is achieved 95% CI [-0.72–1.33]. The FTE food safety employees explains 0.10% of the variance in proportion of successful responses.

According to the analysis above, we can see that there is no significant relationship between FTE food safety employees and proportion of successful responses. The increase is not different from zero, t(345)=-0.447, p=0.655. Therefore, it is not recommended to use FTE food safety employees to predict proportion of successful responses.

Next, the regression line below is aimed to see the relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority

## `geom_smooth()` using formula = 'y ~ x'

And we look at the relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority for establishments rated from A to E.

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

From the result of the models, we can see Overall(A to E): For every extra number of employees as a proportion of the number of establishments, 2407.40 greater proportion of successful responses is achieved 95% CI [1193.72–3621.06]. The number of employees as a proportion of the number of establishments explains 4.22% of the variance in proportion of successful responses.

A: For every extra number of employees as a proportion of the number of establishments, -533.30 lower proportion of successful responses is achieved 95% CI [-1290.00–223.40]. The number of employees as a proportion of the number of establishments explains 0.60% of the variance in proportion of successful responses.

B: For every extra number of employees as a proportion of the number of establishments, 128.59 greater proportion of successful responses is achieved 95% CI [-555.06–812.25]. The number of employees as a proportion of the number of establishments explains 0.04% of the variance in proportion of successful responses.

C: For every extra number of employees as a proportion of the number of establishments, 1169.56 greater proportion of successful responses is achieved 95% CI [250.03–2089.01]. The number of employees as a proportion of the number of establishments explains 1.78% of the variance in proportion of successful responses.

D: For every extra number of employees as a proportion of the number of establishments, 1736.63 greater proportion of successful responses is achieved 95% CI [302.88–3170.38]. The number of employees as a proportion of the number of establishments explains 1.62% of the variance in proportion of successful responses.

E: For every extra number of employees as a proportion of the number of establishments, 4444.40 greater proportion of successful responses is achieved 95% CI [2080.12–6808.69]. The number of employees as a proportion of the number of establishments explains 3.81% of the variance in proportion of successful responses.

According to the analysis above, we can see that there is significant relationship between number of employees as a proportion of the number of establishments rated(overall) and proportion of successful responses. The increase is significantly different from zero, t(345)=3.90, p<.001. Therefore, it is reasonable to use number of employees as a proportion of the number of establishments rated to predict proportion of successful responses.


Question 2

Section 1

Import Libraries

# install.packages('Hmisc')
library(tidyverse)
library(emmeans)
library(gridExtra)
library(knitr)
library(kableExtra)
library(Hmisc)
library(RColorBrewer)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
options(width=100)

Data Preparation

# Import dataset and transform character variables into factor
pub_df <- read.csv('publisher_sales.csv', stringsAsFactors = TRUE)

# Summarise data for more information and check data types to ensure appropriateness
str(pub_df)
## 'data.frame':    6000 obs. of  7 variables:
##  $ sold.by       : Factor w/ 13 levels "Amazon Digital Services,  Inc.",..: 11 1 1 1 13 13 1 6 1 1 ...
##  $ publisher.type: Factor w/ 5 levels "amazon","big five",..: 2 3 5 5 2 2 5 2 5 5 ...
##  $ genre         : Factor w/ 3 levels "childrens","fiction",..: 1 3 3 2 1 1 2 1 2 1 ...
##  $ avg.review    : num  4.44 4.19 3.71 4.72 4.65 4.81 4.33 4.21 3.95 4.66 ...
##  $ daily.sales   : num  61.5 74.9 66 85.2 37.7 ...
##  $ total.reviews : int  92 130 118 179 111 106 205 86 161 81 ...
##  $ sale.price    : num  8.03 9.08 9.48 12.32 5.78 ...
summary(pub_df)
##                                  sold.by           publisher.type         genre     
##  Amazon Digital Services,  Inc.      :4311   amazon       :  78   childrens  :2000  
##  Random House LLC                    : 442   big five     :1658   fiction    :2000  
##  Penguin Group (USA) LLC             : 307   indie        :1330   non_fiction:2000  
##  Simon and Schuster Digital Sales Inc: 272   single author: 809                     
##  HarperCollins Publishers            : 261   small/medium :2125                     
##  Macmillan                           : 175                                          
##  (Other)                             : 232                                          
##    avg.review     daily.sales     total.reviews     sale.price    
##  Min.   :0.000   Min.   : -0.53   Min.   :  0.0   Min.   : 0.740  
##  1st Qu.:4.100   1st Qu.: 56.77   1st Qu.:105.0   1st Qu.: 7.140  
##  Median :4.400   Median : 74.29   Median :128.0   Median : 8.630  
##  Mean   :4.267   Mean   : 79.11   Mean   :132.6   Mean   : 8.641  
##  3rd Qu.:4.620   3rd Qu.: 98.02   3rd Qu.:163.0   3rd Qu.:10.160  
##  Max.   :4.980   Max.   :207.98   Max.   :248.0   Max.   :17.460  
## 
# Remove an unreasonable value
pub_df <- pub_df %>% filter(daily.sales >= 0)

summary(pub_df)
##                                  sold.by           publisher.type         genre     
##  Amazon Digital Services,  Inc.      :4310   amazon       :  78   childrens  :2000  
##  Random House LLC                    : 442   big five     :1658   fiction    :2000  
##  Penguin Group (USA) LLC             : 307   indie        :1329   non_fiction:1999  
##  Simon and Schuster Digital Sales Inc: 272   single author: 809                     
##  HarperCollins Publishers            : 261   small/medium :2125                     
##  Macmillan                           : 175                                          
##  (Other)                             : 232                                          
##    avg.review     daily.sales     total.reviews     sale.price    
##  Min.   :0.000   Min.   :  3.49   Min.   :  0.0   Min.   : 0.740  
##  1st Qu.:4.100   1st Qu.: 56.78   1st Qu.:105.0   1st Qu.: 7.140  
##  Median :4.400   Median : 74.30   Median :128.0   Median : 8.630  
##  Mean   :4.267   Mean   : 79.12   Mean   :132.6   Mean   : 8.641  
##  3rd Qu.:4.620   3rd Qu.: 98.02   3rd Qu.:163.0   3rd Qu.:10.160  
##  Max.   :4.980   Max.   :207.98   Max.   :248.0   Max.   :17.460  
## 
# Confirm whether there is na values again
sum(is.na(pub_df))
## [1] 0

Examine whther books from different genres have different daily sales on average

One-Way ANOVA

pub_df %>% group_by(genre) %>% summarise(n())
## # A tibble: 3 × 2
##   genre       `n()`
##   <fct>       <int>
## 1 childrens    2000
## 2 fiction      2000
## 3 non_fiction  1999
m.sales.by.genre <- lm(daily.sales~genre, data=pub_df)
anova(m.sales.by.genre)
## Analysis of Variance Table
## 
## Response: daily.sales
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## genre        2 2562023 1281012  2594.7 < 2.2e-16 ***
## Residuals 5996 2960293     494                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.genre.emm <- emmeans(m.sales.by.genre, ~genre) )
##  genre       emmean    SE   df lower.CL upper.CL
##  childrens     55.6 0.497 5996     54.6     56.6
##  fiction      105.9 0.497 5996    104.9    106.9
##  non_fiction   75.9 0.497 5996     74.9     76.9
## 
## Confidence level used: 0.95
( m.sales.by.genre.pairs <- confint(pairs(m.sales.by.genre.emm)) )
##  contrast                estimate    SE   df lower.CL upper.CL
##  childrens - fiction        -50.3 0.703 5996    -52.0    -48.7
##  childrens - non_fiction    -20.3 0.703 5996    -22.0    -18.7
##  fiction - non_fiction       30.0 0.703 5996     28.3     31.6
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

Presenting a One-Way ANOVA

p.sales <- ggplot(summary(m.sales.by.genre.emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genres ", y="Daily Sales", title = "CI for Daily Sales by Genres", subtitle="Error Bars are Extent of 95% CIs")

p.contrasts <- ggplot(m.sales.by.genre.pairs, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + geom_hline(yintercept=0, lty=2) + labs(x="Contrast", y="Difference in Average Daily Sales", title = "CI for Daily Sales by  \n Difference between Genres", subtitle="Error Bars are Extent of 95% CIs") + theme(axis.text.x = element_text(angle = 10))

grid.arrange(p.sales, p.contrasts, ncol=2)

Check whether books have more/fewer sales depending upon their average review scores and total number of reviews.

# Predicting average daily slaes from both average review and total reviews
m.sales.by.avg.total.review <- lm(daily.sales ~ avg.review + total.reviews, data = pub_df)
summary(m.sales.by.avg.total.review)
## 
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = pub_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.407  -14.656   -1.071   13.672  122.177 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   24.123430   2.340120  10.309  < 2e-16 ***
## avg.review    -3.999637   0.512874  -7.798 7.34e-15 ***
## total.reviews  0.543327   0.007816  69.517  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.58 on 5996 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4461 
## F-statistic:  2416 on 2 and 5996 DF,  p-value: < 2.2e-16
# Check multicollinearity again
vif(m.sales.by.avg.total.review)
##    avg.review total.reviews 
##      1.011022      1.011022
# Show the effect of one variable with the other variables held constant
sales.preds <- tibble(avg.review = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[1]), total.reviews = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[2]))

sales.preds <- mutate(sales.preds, Sales.hat = predict(m.sales.by.avg.total.review, sales.preds))

ggplot(sales.preds, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = Sales.hat)) + guides(fill=guide_legend(title="Sales"))

# Get the confidence intervals for the estimation approach
cbind(coefficient=coef(m.sales.by.avg.total.review), confint(m.sales.by.avg.total.review))
##               coefficient      2.5 %     97.5 %
## (Intercept)     24.123430 19.5359532 28.7109077
## avg.review      -3.999637 -5.0050539 -2.9942200
## total.reviews    0.543327  0.5280054  0.5586487
# interaction
m.intr <- lm(daily.sales ~ avg.review*total.reviews, data = pub_df)
summary(m.intr)
## 
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = pub_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.084  -14.641   -0.946   13.822   92.351 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               63.676679   4.174384  15.254  < 2e-16 ***
## avg.review               -13.709806   0.992277 -13.817  < 2e-16 ***
## total.reviews              0.165853   0.034038   4.873 1.13e-06 ***
## avg.review:total.reviews   0.091422   0.008028  11.388  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.34 on 5995 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.4577 
## F-statistic:  1689 on 3 and 5995 DF,  p-value: < 2.2e-16
intr.surf <- tibble(avg.review = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[1]), total.reviews = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[2]))

intr.surf <- mutate(intr.surf, intr.hat = predict(m.intr, intr.surf))

ggplot(intr.surf, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = intr.hat)) + labs(subtitle = "Interaction Effects of Avg.review and Total.reviews") + guides(fill=guide_legend(title="Sales"))

Examine what is the effect of sale price upon the number of sales

# Calculate correlation coefficients
rcorr(as.matrix(select(pub_df, daily.sales, sale.price)))
##             daily.sales sale.price
## daily.sales        1.00      -0.28
## sale.price        -0.28       1.00
## 
## n= 5999 
## 
## 
## P
##             daily.sales sale.price
## daily.sales              0        
## sale.price   0
ggplot(pub_df, aes(x=sale.price, y=daily.sales, alpha = 0.1)) + geom_point() + labs(y="Number of Sales Daily", x="Sale Price", title=expression(r==-0.28), subtitle="The effect of sale price upon the number of sales") + geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

m.dsales.by.price <- lm(daily.sales~sale.price, data=pub_df)
summary(m.dsales.by.price)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = pub_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.760 -20.644  -4.638  17.084 130.301 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 112.0540     1.5201   73.72   <2e-16 ***
## sale.price   -3.8110     0.1704  -22.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.15 on 5997 degrees of freedom
## Multiple R-squared:  0.07696,    Adjusted R-squared:  0.0768 
## F-statistic:   500 on 1 and 5997 DF,  p-value: < 2.2e-16
# Get the confidence intervals for the estimation approach
cbind(coefficient = coef(m.dsales.by.price), confint(m.dsales.by.price))
##             coefficient      2.5 %     97.5 %
## (Intercept)  112.054023 109.074077 115.033968
## sale.price    -3.810984  -4.145101  -3.476867

Is this (the effect of sale price upon the number of sales) different across genres?

ggplot(pub_df, aes(x=sale.price, y=daily.sales, alpha = 0.1)) + geom_point() + labs(y="Number of Sales Daily", x="Sale Price", subtitle="The effect of sale price upon the number of sales") + geom_smooth(method=lm) + facet_wrap(~ genre)
## `geom_smooth()` using formula = 'y ~ x'

# Filter by genre
child_df <- pub_df %>%
  filter(genre == "childrens")

fic_df <- pub_df %>%
  filter(genre == "fiction")

nonf_df <- pub_df %>%
  filter(genre == "non_fiction")
# Calculate correlation coefficients
rcorr(as.matrix(select(child_df, daily.sales, sale.price)))
##             daily.sales sale.price
## daily.sales        1.00      -0.23
## sale.price        -0.23       1.00
## 
## n= 2000 
## 
## 
## P
##             daily.sales sale.price
## daily.sales              0        
## sale.price   0
rcorr(as.matrix(select(fic_df, daily.sales, sale.price)))
##             daily.sales sale.price
## daily.sales        1.00      -0.02
## sale.price        -0.02       1.00
## 
## n= 2000 
## 
## 
## P
##             daily.sales sale.price
## daily.sales             0.4203    
## sale.price  0.4203
rcorr(as.matrix(select(nonf_df, daily.sales, sale.price)))
##             daily.sales sale.price
## daily.sales        1.00      -0.04
## sale.price        -0.04       1.00
## 
## n= 1999 
## 
## 
## P
##             daily.sales sale.price
## daily.sales             0.0539    
## sale.price  0.0539
# Build linear regression models
m.child.dsales.by.price <- lm(daily.sales~sale.price, data=child_df)
summary(m.child.dsales.by.price)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = child_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.929  -9.503  -0.030   9.585  56.579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  72.8781     1.6336   44.61   <2e-16 ***
## sale.price   -1.7319     0.1603  -10.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 1998 degrees of freedom
## Multiple R-squared:  0.0552, Adjusted R-squared:  0.05472 
## F-statistic: 116.7 on 1 and 1998 DF,  p-value: < 2.2e-16
m.fic.dsales.by.price <- lm(daily.sales~sale.price, data=fic_df)
summary(m.fic.dsales.by.price)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = fic_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.383  -20.066    0.179   20.095  102.366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 108.0774     2.7970  38.640   <2e-16 ***
## sale.price   -0.2732     0.3389  -0.806     0.42    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.34 on 1998 degrees of freedom
## Multiple R-squared:  0.000325,   Adjusted R-squared:  -0.0001753 
## F-statistic: 0.6496 on 1 and 1998 DF,  p-value: 0.4203
m.nonf.by.price <- lm(daily.sales~sale.price, data=nonf_df)
summary(m.nonf.by.price)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = nonf_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.526 -13.381  -0.048  13.322  86.960 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.2755     1.8043  43.936   <2e-16 ***
## sale.price   -0.4262     0.2210  -1.929   0.0539 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.97 on 1997 degrees of freedom
## Multiple R-squared:  0.001859,   Adjusted R-squared:  0.001359 
## F-statistic: 3.719 on 1 and 1997 DF,  p-value: 0.05393
# Get the confidence intervals for the estimation approach
cbind(coef(m.child.dsales.by.price), confint(m.child.dsales.by.price))
##                           2.5 %    97.5 %
## (Intercept) 72.878117 69.674294 76.081941
## sale.price  -1.731864 -2.046235 -1.417494
cbind(coef(m.fic.dsales.by.price), confint(m.fic.dsales.by.price))
##                               2.5 %      97.5 %
## (Intercept) 108.0773903 102.5919734 113.5628073
## sale.price   -0.2731555  -0.9378044   0.3914934
cbind(coef(m.nonf.by.price), confint(m.nonf.by.price))
##                             2.5 %      97.5 %
## (Intercept) 79.2754689 75.7369156 82.81402222
## sale.price  -0.4262021 -0.8596162  0.00721202

Section 2

This report demonstrates the results of the analyses from the dataset of publishing company. The total number of the sales records in the data is 6000. There was only one inaccurate data in the dataset and has been removed, leaving 5999 records.

First, we look at the summary of the sales records distinguished by genres.
Table 1. Summary of sales records across genres
genre n()
childrens 2000
fiction 2000
non_fiction 1999

A one-way ANOVA tests whether the variable genre has a significant effect on the daily sales. We can say “The daily sales differs significantly across the genre of books, F(2,5996)=2594.7, p<0.001.

It was also found that fiction books have the highest average sales(105.9), followed by non_fiction(75.9), and children has the lowest daily sales children(55.6). The estimates of the difference in daily sales for each pair of genres is demonstrated in the right panel below. The estimate for the childrens-fiction comparison shows a point estimate of 50.3 daily sales greater gain for fiction, which is the greatest difference in the three pairs. And the 95% CI spans sales greater for fiction from 48.7 to 52; the estimate for the childrens-non_fiction comparison shows a point estimate of 20.3 daily sales greater gain for non_fiction, and the 95% CI spans sales greater for non_fiction from 18.7 to 22; the estimate for the fiction-non_fiction comparison shows a point estimate of 30 daily sales greater gain for fiction, and the 95% CI spans sales greater for fiction from 28.3 to 31.6.

Next, a model was built to look into whether books have more sales depending upon their average review scores and total number of reviews.

From the results, we can see the number of average daily sales decreases by a statistically significant 4, t(5996)=-7.80, p<.0001, for every extra increase in average review score, holding the total number of reviews constant.

When controlling for the average review scores, the average number of sales increases by 0.54 for every extra review, which is significantly different from zero t(5996)=69.52, p<.0001.

And we also calculate the confidence intervals for the estimation approach because it gives us more information. The number of average daily sales decreases by 4.00, 95% CI [-5.00, -3.00] for every extra average review score; the number of average daily sales increases by 0.54, 95% CI [0.53, 0.56] for every extra review.

Sometimes, we may encounter situations that the effect of a predictor is different depending upon the value of another variable. Therefore, we would like to build a new model to see the interaction between average review scores and the total number of reviews. The result shows that there is a a significant positive interaction between them. This means that when the value of average review scores is higher, a slight increase in the total number of reviews will lead to massive rise in sales.

## `geom_smooth()` using formula = 'y ~ x'

Next, a model was built to look into the effect a unit change in sale price (on the x axis) has on the number of sales. From the reslut, we see that the number of sales decreases 3.81 for extra every extra unit of sale price, and the sale price explains 7.7% variance in number of average daily sales. The decrease is significantly different from zero, t(5997)=-22.36, p<.0001. Therefore, it is reasonable to use sales price to predict average daily sales.

## `geom_smooth()` using formula = 'y ~ x'

We can see the correlation between sale price and number of average daily sales across different genres are all negative. It indicates that an increase in sale price has a negative impact on the number of sales regardless of book genre; however, they are just slightly related. For every extra unit of sale price in children type, 1.73 lower number of sales is sold 95% CI [1.42–2.05], and it explains 5.52% of the variance in daily sales; for every extra unit of sale price in fiction type, 0.27 lower number of sales is sold 95% CI [-0.94–0.39], and it explains 0.03% of the variance in daily sales; For every extra unit of sale price in non_fiction type, 0.43 lower number of sales is sold 95% CI [-0.86–0.01], and it explains 1.86% of the variance in daily sales.

Based on the result from the built linear models, only the children type effect of sale price upon average daily sales is significant, t(1998)=-10.8, p<0.0001. The effect of sale price upon the number of sales from the other two types are not significant.